Attention Is All You Need
Paper available on: https://arxiv.org/abs/1706.03762
Recap:
- Traditionally, NLP tasks were approached using RNNs
- Information flows thru hidden states that need to retain information
- Decoder provides the next word once the value from hidden states is predicted
- Inefficient because of the amount of steps and calculations
- RNNs have a hard time learning long-term dependencies
Attention:
- Decoder can decide to attent to hidden states in previous states of encoding
- There's no need to use the whole chain
- Outputs Keys (K) and Querys (Q)
- Keys index hidden states via softmax
Attention is a major paradigm shift:
- The source sentence is fed into inputs
- The target sentence is fed into outputs
- Output probability is the next word
- In contrast to RNN, the output of a single token is one sample. There's no multistep backprop.
Multi-Head Attention:
- Use attention over the input sequence (sentence)
Combining the source sequence with the multi-head attention when the attention
- The encoder of the source sentence discovers interesting things & builds Key-Value pairs
- The encoder of the target sentence builds the Queries
- The Values of the source sentence are indexed using Keys
- The Query part asks about what the network would like
Advantages:
- Reduction in the need of computation steps
- Shorter pathlinks
- Implications for other machine learning fields (computer vision etc.)
Look at: